Minimum Message Length Clustering Using Gibbs Sampling
نویسنده
چکیده
The K-Means and EM algorithms are popular in clustering and mixture modeling due to their simplicity and ease of implementation. However, they have several significant limitations. Both converge to a local optimum of their respective objective functions (ignoring the uncertainty in the model space), require the apriori specification of the number of classes/clusters, and are inconsistent. In this work we overcome these limitations by using the Minimum Message Length (MML) principle and a variation to the KMeans/EM observation assignment and parameter calculation scheme. We maintain the simplicity of these approaches while constructing a Bayesian mixture modeling tool that samples/searches the model space using a Markov Chain Monte Carlo (MCMC) sampler known as a Gibbs sampler. Gibbs sampling allows us to visit each model according to its posterior probability. Therefore, if the model space is multi-modal we will visit all modes and do not get stuck in local optima. We call our approach multiple chains at equilibrium (MCE) MML sampling.
منابع مشابه
Intrinsic Classification of Spatially Correlated Data
Intrinsic classification, or unsupervised learning of a classification, was the earliest application of what is now termed minimum message length (MML) or minimum description length (MDL) inference. The MML algorithm ‘Snob’ and its relatives have been used successfully in many domains. These algorithms treat the ‘things’ to be classified as independent random selections from an unknown populati...
متن کاملModeling Transcription Factor Binding Sites with Gibbs Sampling and Minimum Description Length Encoding
Transcription factors, proteins required for the regulation of gene expression, recognize and bind short stretches of DNA on the order of 4 to 10 bases in length. In general, each factor recognizes a family of "similar" sequences rather than a single unique sequence. Ultimately, the transcriptional state of a gene is determined by the cooperative interaction of several bound factors. We have de...
متن کاملSimultaneous alignment and clustering of peptide data using a Gibbs sampling approach
MOTIVATION Proteins recognizing short peptide fragments play a central role in cellular signaling. As a result of high-throughput technologies, peptide-binding protein specificities can be studied using large peptide libraries at dramatically lower cost and time. Interpretation of such large peptide datasets, however, is a complex task, especially when the data contain multiple receptor binding...
متن کاملIdentification of Minimum Redundancy Tagging SNPs via Gibbs Sampling
Single nucleotide polymorphisms (SNPs) are genetic changes that can occur within a DNA sequence. Due to the high frequency of SNPs in the human genome, it is desirable to select a small set of SNPs (tagging SNPs) that can be used to represent the majority of SNPs. We propose a Gibbs sampling approach to find a small set of SNPs with minimum redundancy for tagging purposes. Preclustering is adde...
متن کاملClustering using the Minimum Message Length Criterion and Simulated Annealing
Clustering has many uses such as the generation of taxonomies and concept formation. It is essentially a search through a model space to maximise a given criterion. The criterion aims to guide the search to find models that are suitable for a purpose. The search’s aim is to efficiently and consistently find the model that gives the optimal criterion value. Considerable research has occurred int...
متن کامل